Time Series trend (Trendline) Plots
Time series, time series trend, or trend-line plots
are essential to the visualization of temporal data. (I will refer to
them as “time series” plots.) A time series plot shows whether a
quantity increases or decreases over time. As importantly, time series
plots reveal changes in volatility, the sequence of cause and effect,
and the impact of interventions on the time series–insights that are
less obvious but crucial for analysis.
However, time series plots must be carefully constructed because they
easily become cluttered as lines crisscross. Reference and grid lines
add to the clutter. My advice is to limit each plot to three time series
at most; sometimes, even that is too many.. Take as an
example the time series plot below. The plot is derived from Besley and
Burgess’s (2000) data on flooding in Indian states. It depicts the
annual flood affected area for each of 4 states from 1950 to 1995.
Following Kastellac’s
(2025) advice:
I use color to distinguish the four time series;
I attach similarly colored labels to the series of each state to
avoid using a legend that takes up space and forces the reader’s eyes to
saccade from data to legend;
I ensure that axes are labelled–and that those labels are large
enough to be easily legible.
For all that, the plot remains a colorful mess–and it only plots 4
time series; there are 12 other states in the data set.

A better option in this case, and in many others, is to employ
facets (also called “panels” or “small multiples”) that
show the same plot for each state in miniature.
My improved time series plots employs a minimalist gray-scale
design that avoids visual distraction. Unlike Schwabish (2021) and
Kastellac (2025), who advocate for use of color, I follow Cleveland’s
advice to use color sparingly; in this re-organized time series plot, it
is unnecessary.
The deliberate use of identical x-axis and y-axis
scales–notwithstanding the varying magnitudes of flooding across
state–is important. This uniformity ensures that the reader only needs
to decode the plot’s structure and meaning once.
The ordering of the data is also critical: I arrange the states
from least (top-left) to most (bottom-right) flooding. This avoids the
arbitrariness of alphabetical ordering. (This was actually a chore to do
in Stata as my code at the end of the chapter shows.)
I duplicate the y-axis on the right to provide the reader with a
visual anchor, given that the graph spans four facets. Rather than
replicating y- and x-axis labels for every facet–which would introduce
redundant clutter–the second y-axis offers a cleaner and more effective
way to facilitate horizontal comparisons. With fewer facets (e.g., two
wide), this second axis would be unnecessary. I also adjusted the aspect
ratio to slightly compress the width, further enhancing horizontal
comparisons.
Moving Averages to Smooth Time Series Data
Highly variable time series data can obscure trends, making it
difficult to draw meaningful conclusions. For example, we might wonder
whether voter turnout has declined in U.S. elections over time. By
applying smoothing techniques, we can address this issue and uncover any
underlying patterns that may be hidden by erratic fluctuations.

Implementing a Moving Average
Implementing a moving average involve five conceptual steps:
Order the data: Arrange observations \(X_t\) sequentially by time, from \(t=1\) to \(t=T\).
Define the window: Choose a window or span of \(k\) time periods where \(k \geq 2\) and \(k \ll T_k\).
Compute the moving average: For the first window, compute the
mean of \(k\) observations from \(X_{t=1}\) to \(X_{t=k}\).
Shift the window: Move the window forward by one period.
Repeat: Continue computing the mean for subsequent
windows.
Practical Example: Programming a Moving Average in Sheets
While many software programs (including R, Stata, and many
spreadsheets) automate this process, it is instructive to compute a
moving average step-by-step. This allows one to better understand what
your software is doing when you instruct it to smooth a time series by
computing a moving average. We will do this in Sheets because it is
particularly transparent.
Implementing Moving Averages in Sheets
1. Set up the data set:
◦ Cell A1 -> “Year”
◦ Cell A2 -> 1980
◦ Cell A3 -> = A2+1 (Fill down for 20-30 rows)
2. Generate random data:
◦ Cell B1 -> “Y”
◦ Cell B2 -> = (A2 * 0.1) + 15 * RAND()
3. Fix random values: Copy > Paste Special > Values Only to paste the values into Column C to prevent random number changes.
4. Calculate the moving average:
◦ In D1, label the column as “MA(Y)”.
◦ In D5, use the formula: =AVERAGE(C2:C5) and fill down.
5. Create a trend-line chart:
◦ Highlight the data and select the chart icon.
◦ Compare the raw data (Column C) with the moving average (Column D).
Logarithms with Different Bases in R and Stata
In R, use log(x, base) to compute logarithms with any
base. If base is omitted, e is assumed
(log(x) = ln(x)). R also provides shortcuts
for common bases: log2(x) = log(x, 2) and
log10(x) = log(x, 10).
In Stata, log(x) and ln(x) both compute the
natural logarithm (ln(x)), while log10(x)
computes base 10 logarithms. To compute logarithms of other bases, you
must apply the change of base rule (see above). For
example, to compute \(\log_2 x\) in
Stata, type:
gen y = log(x) / log(2)
Lowess (or Loess) Smoothing
A key distinction in data modeling is between
parametric and nonparametric
estimation:
Parametric estimation assumes a specific
functional form for the relationship between variables. For example,
linear regression assumes that the relationship is linear and that the
residuals follow a normal distribution.
The advantage of parametric methods lies in their simplicity and
interpretability: when the assumptions hold, they provide precise and
efficient estimates. However, these assumptions can be restrictive and,
when violated, can produce misleading results.
Nonparametric estimation makes no such
assumptions about the functional form or distribution, allowing for more
flexibility in capturing complex relationships. Nonparametric methods
are more adaptable and can reveal patterns that parametric models might
miss, but they run the risk of over-fitting if not applied
carefully.
Lowess (locally weighted regression) is a
nonparametric technique that fits smooth curves to multivariate data by
allowing the data to “speak for themselves.” It is particularly useful
when:
Theoretical guidance is limited: If there is no
prior knowledge of the functional relationship between the variables,
lowess can provide an exploratory look.
The data are complex: The presence of noise,
non-linearity, or outliers makes a flexible approach desirable.
However, the flexibility of lowess comes with a trade-off: while it
is adept at extracting an interpretable signal from noisy data, it is
also prone to over-fitting. Over-fitting occurs when the lowess curve
closely follows random fluctuations in the data, capturing noise rather
than the underlying trend.
Steps for Lowess Smoothing
This series of graphics (based on Cleveland 1993, 95) illustrates the
conceptual process of lowess smoothing. Understanding these steps will
help you appreciate what the software’s lowess algorithm is doing when
it automatically fits the curve to your data.
Step 1: Fix the window size, \(\alpha\)
Order the data: Sort observations \((x_i,y_i)\) by the independent variable,
i.e. from smallest to largest \(x_i\).
Define intervals: Divide the range of \(x\) into intervals (windows) centered on
evaluation points \(v_j\).
A “tension” parameter, \(\alpha\), controls the width of these
windows. When \(\alpha = 1\), the
window’s width is set such that it contains 100 percent of the data;
when \(\alpha = .5\) the window’s width
is set such that it contains 50 percent of the data, and so on.

Step 2: Weight the data
- Weight the data points in the window based on a kernel (just as in
KDE). Just as with KDE, a variety of kernels are available–but the width
of the window tends to be the more important parameter.

Step 3: Estimate local regression
Estimate local regression: Regress \(y\) on \(x\) within each window. A control
parameter, \(\lambda\), controls the
degree of polynomial on \(x\) in these
regressions.
Predict values: Use the regression results to predict \(y\) values at evaluation points. (The
illustrated best-fit line may strike you as too flat given the data-–but
recall that the data are weighted.)

Step 4: Slide window one obervation and repeat
The window is then moved to the right by one observation, and the
next local regression is estimated. This is repeated until the window
hits the the right-edge of the data.
Connect the points: Join the predicted values to form a smooth
curve.

Setting Control Parameters
\(\alpha\) (window
width): Controls the bias-variance trade-off by determining the
width of the window. A small \(\alpha\)
risks over-fitting by capturing noise, while a large \(\alpha\) may smooth the data so much that
it obscure important features. We have encountered this sort of
trade-off before, with both histograms and kernel density
estimates.
\(\lambda\) (degree of
polynomial): Determines the functional form of the local
regression. For example:
- When \(\lambda = 1\), the local
regression fits a linear model of the form \(y
= \beta_0 + \beta_1x\), which assumes a straight-line
relationship between \(x\) and \(y\).
- When \(\lambda = 2\), the local
regression fits a quadratic model of the form \(y = \beta_0 + \beta_1x + \beta_2x^2\),
which allows for curvature, capturing relationships where \(y\) may increase and then decrease or vice
versa.
Practical Implementation
- Set \(\lambda\) by
inspection:
If the data cloud is monotonic, that is, it rises or
falls uniformly across the range of \(x\), use \(\lambda = 1\).
If the data cloud is non-monotonic, that is, it rising
and then falls (or vice versa) across the range of
\(x\), then use \(\lambda = 2\).
You should almost never need \(\lambda
> 2\).
- Determine \(\alpha\)
iteratively: Start with a small value and gradually increase
until the lowess curve is appropriately smooth. If the lowess curve is
almost a straight line slope, reduce it a little – because at that point
lowess provides little more insight than linear regression of \(y\) on \(x\)
Evaluating the Lowess Model
- Residual analysis: Plot the residuals against the independent
variable. The residual plot should show no systematic patterns if the
lowess curve is well-fitted. Residuals are the differences between the
observed values and the values predicted by the lowess curve. A
well-fitted lowess curve should capture all systematic variation in the
data, leaving only random noise in the residuals. If the residuals show
a pattern, such as a trend or clustering, it suggests that the lowess
curve has not fully captured the underlying relationship between the
variables. This could indicate the need to adjust the bandwidth \(\alpha\) or the degree of the polynomial
\(\lambda\) to better fit the data. The
goal is to ensure that the residuals are distributed randomly around
zero, with no discernible structure, indicating that the model has
effectively removed any predictable components.
Practical Example: Fitting and Diagnosing Loess
This example uses geom_smooth and
stat_smooth in R’s ggplot2 to fit a loess
curve to the Kostelka and Blais data, and then to diagnose the fit. Both
geom_smooth and stat_smooth provide options to
set bandwidth and polynomial degree. I provide parallel Stata code at
the end of the chapter.
Step 1: Choose a bandwidth and estimate the loess function
I begin by setting the bandwidth of the loess curve to .4 (`span =
.4).
# A tibble: 6 × 46
country year EL_TYPE PLT_dem0_time Vdem_v2elsrgel Vdem_v2ellocelc
<dbl+lbl> <dbl> <dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl>
1 1 [Albania] 2005 0 [Legislati… 16 5 [Executive … 5 [Executive a…
2 1 [Albania] 2009 0 [Legislati… 20 5 [Executive … 5 [Executive a…
3 1 [Albania] 2013 0 [Legislati… 24 5 [Executive … 5 [Executive a…
4 1 [Albania] 2017 0 [Legislati… 28 5 [Executive … 5 [Executive a…
5 2 [Argentina] 1973 0 [Legislati… 1 5 [Executive … 5 [Executive a…
6 2 [Argentina] 1973 1 [President… 1 5 [Executive … 5 [Executive a…
# ℹ 40 more variables: Vdem_v2elparlel <dbl+lbl>, Vdem_v2ex_elechos <dbl>,
# Vdem_v2lgbicam <dbl+lbl>, Vdem_v2ddyrpl_1 <dbl>, Vdem_v2ddyrpl_2 <dbl>,
# Vdem_v2ddyrpl_3 <dbl>, Vdem_v2ddyrpl_4 <dbl>, Vdem_v2ddyrrf_1 <dbl>,
# Vdem_v2ddyrrf_2 <dbl>, Vdem_v2ddyrrf_3 <dbl>, Vdem_v2ddyrrf_4 <dbl>,
# Vdem_v2ddyror_1 <dbl>, Vdem_v2ddyror_2 <dbl>, Vdem_v2ddyror_3 <dbl>,
# Vdem_v2ddyror_4 <dbl>, Vdem_v2ddyrci_1 <dbl>, Vdem_v2ddyrci_2 <dbl>,
# Vdem_v2ddyrci_3 <dbl>, Vdem_v2ddyrci_4 <dbl>, WB_v_513 <dbl>, …

Step 2: Save and diagnose residuals
The next step is to diagnose the data using R’s loess
function to generate predicted values and residuals. I then fit a second
loess curve through the scatterplot of the residuals graphed against
logged electorate size.
Recall that, in theory, the loess curve that we plot through that
residual data cloud should be flat, indicating that the residuals are
uncorrelated to logged electorate size. If that is not the
case, it would suggest that our initial bandwidth was too large to fully
capture the relationship between turnover and logged electorate
size.
The residual loess curve fluctuates within narrow bounds around 0. The
curve does not systematically rise or fall. This evidence suggests that
the residuals are essentially uncorrelated with logged electorate size,
and in turn, that the original choice of a bandwidth of .4 was
sufficient to capture the relationship between turnout and logged
electorate size. One could repeat this exercise using a slightly larger
bandwidth in the original plot–but this evidence indicates that
bandwidth of .4 is defensible.
Quantile-Quantile Plots
A quantile-quantile (Q–Q) plot compares quantiles of one variable (on
one axis) with those of another (on the other axis). Points close to the
reference line suggest similar shapes and scales for the two
distributions, whereas deviations indicate differences in distribution
or scale. One might think overlaid kernel density (KDE) plots of the two
distributions would serve much the same purpose. That is not quite so.
First, a KDE plots shows the probability density function (PDF) of a
variable; a quantile plot shows the cumulative density function (CDF).
The two density functions are mathematically related, of course (the PDF
is the derivative of the CDF), but for all but the simplest
distributions it is difficult for humans to conceive of a distribution’s
CDF solely on the basis of a visual depiction of its PDF, and vice
versa. Thus, we should see Q-Q plots as complementary to overlaid KDEs.
Second, the KDE purposely interpolates between gaps in the data to
generate a smooth curve. Quantile plots do not “polish” away gaps in the
data in this way–and hence are more directly informative about the
frailties of the underlying data. The same can be said for points in the
distribution where the data bunch up. For both reasons, we may want to
use Q–Q plot to compare the distributions of two variables.
Q–Q Plots for Index Evaluation
A useful variation on the standard quantile–quantile plot is to graph
the quantiles of one variable (e.g., an index or specific indicator)
against the quantiles of another (e.g., a benchmark index or dimension).
This approach can help reveal:
- Relationships among indicators or dimensions of an
index
- Alignment between an index and a benchmark
- Potential shortcomings in how the index tracks an
underlying concept
Ideally, an index should increase smoothly and monotonically as the
latent concept it measures increases. If the Q–Q plot shows “flat
spots”, however, it would suggest one measure is rising smoothly while
the other “stutters”, indicating a mismatch in how the two variables
capture the underlying phenomenon. This issue is common when comparing a
continuous measure to an discrete one.
Here is a political science example. The Q-Q plot below compares the
Polity index’s democracy dimension with V-Dem’s electoral democracy
(polyarchy) index. I multiplied the latter by 10 to align both measures
on the same scale, although this step is not necessary.
Conceptually, a Q-Q plot aligns quantiles of one variable with those
of another. If both distributions have symmetric moments, data points
should lie on the 45-degree line, indicating that a one-percentile rise
in one distribution matches a one-percentile rise in the other. We do
not see that here. Instead, small increments in V-Dem’s electoral
democracy index correspond to large jumps in Polity’s democracy
dimension, indicating that Polity differentiates more sharply among
regimes that V-Dem considers relatively homogeneous. At higher scores,
the pattern reverses: Polity draws minimal distinctions among regimes
that V-Dem treats as highly diverse.
This Q-Q plot also highlights the Polity scale’s discrete basis,
which clusters observations at each point on the scale. By contrast,
V-Dem’s measure is more continuous. This plot does not
indicate that one measure is valid and the other invalid; it only shows
that their distributions differ, reflecting distinct conceptual and
operational assumptions. V-Dem’s electoral democracy index provides
finer distinctions among regimes that Polity considers largely
homogeneous, and this is especially evident at the higher end (8+) of
the Polity democracy scale.

For more detail on Q–Q plots and their applications, see Wikipedia: Q–Q
Plot
Exercise: Lowess Smoothing
- Generate a scatterplot with a lowess curve.
- Evaluate the propriety of the bandwidth using residual plots.
- Provide a brief assessment of the lowess model and submit relevant code and data.
Summary
This chapter highlights the importance of effective data
visualization and transformation in revealing patterns and trends in
bivariate relationships. By understanding how to scale, smooth, and
transform data, you will be better equipped to interpret complex data
sets and derive meaningful insights.
Appendix: Stata and R Code
This Stata code reproduces the “better” time series plot. The nuance
in the coding lies in the way that I had to reorder and re-label states
to coax Stata to order the facets by flood affected area.
// A faceted trendline charts to avoid occlusion and clutter!
use "Calamity.dta", clear
* Ordering states by flood affected area
gen mean_fafarea = .
forvalues i = 1/21 {
summ fafarea if state == `i'
replace mean_fafarea = r(mean) if state == `i'
}
sort mean_fafarea year
egen state_order = group(mean_fafarea statenm), label(statenm, replace)
* (1) Replace & with "and" in the state names.
replace statenm = subinstr(statenm, "&", "and", .)
* (2) Create a helper variable that picks the state name for each state_order group.
bysort state_order: gen label_name = statenm[1]
* Get the unique numeric levels of state_order.
levelsof state_order, local(levels)
local first = 1
foreach lvl of local levels {
quietly levelsof label_name if state_order == `lvl', local(mylabel)
display as text "For state_order value `lvl', mylabel: " `"`mylabel'"'
if `first' == 1 {
label define new_lab `lvl' `"`mylabel'"'
local first 0
}
else {
label define new_lab `lvl' `"`mylabel'"', modify
}
}
* Attach the new value label to state_order.
label values state_order new_lab
*The plot itself
twoway ///
(line fafarea year, sort) ///
(line fafarea year, sort yaxis(2) lcolor(none)), ///
ytitle("Flood Affected Area (millions hectares)", size(small)) ///
ylabel(0(4)8, labsize(medsmall)) ymtick(##4) ///
ylabel(0(4)8, labsize(medsmall) axis(2)) ymtick(##4, axis(2)) ///
xtitle("") xlabel(1950(10)2000) xmtick(##2) ///
by(state_order, note("Data Source: Besley & Burgess (2000)", margin(medium) size(small)) ///
title("Flooding in Indian States, 1950-1996", margin(medium)) legend(off)) aspect(.45)
This Stata code reproduces the scatterplots used to illustrate
logarithmic transformations
// do.file to reproduce Ch. 8 scatterplots
//Upload and pre-process data
use "Kostelka & Blais Base.dta", clear
* I want only countries with long time series of 20+ legislative elections
* sort data by county, election type, year
gsort country EL_TYPE year
* drop presidential elections
drop if EL_TYPE==1
*compute N legislative elections per country
by country, sort: gen N_Elections = _N
*keep countries with 20+ elections
keep if N_Elections > 19
* Create some transformations for use later:
* Odds Ratio
gen PRturnout = (Turnout/100)
gen ORturnout = PRturnout/(1-PRturnout)
label var ORturnout "Turnout (Odds Ratio)"
* The logit or log odds of turnout (expressed as a proportion!).
* Stata has a logit function!
gen logitTO = logit(PRturnout)
label var logitTO "Logit of Turnout"
* Log 10 electorate size
gen log10ELECTORATE = log10(electorate_size * 1000000)
label var log10ELECTORATE "Electorate Size (log base 10)"
*Electorate size in log base 2 -- must use base conversion rule
gen log2ELECTORATE = log10ELECTORATE/log10(2)
label var log2ELECTORATE "Electorate Size (log base 2)"
*You can express logits in base 2 too: 1 unit increase => doubling of log(2) odds.
gen logit2TO = logitTO/ln(2)
label var logit2TO "Logit (base 2) of Turnout"
// Panel A: Turnout by electorate size
* Even after rescaling electorate size, electorate size highly right skewed
* Many observations clustered and occluded in top-left of plot
scatter Turnout electorate_size, msize(medsmall) msymbol(circle) mfcolor(gs12%40) mlcolor(gs4) mlwidth(vthin) jitter(1) xtitle("Electorate Size (millions)", size(vsmall)) ytitle("Turnout (% reg. voters)", size(vsmall)) title("Panel A", size(small) position(11) margin(vsmall)) aspect(1) graphregion(margin(small)) plotregion(margin(small))
*save graph as png. Add path as necessary.
graph export "scatter_transform_1.png", as(png) name("Graph") replace
// Panel B: Turnout by log10(electorate size)
*Rescaling electorate size via log10 transformation eases right skew
scatter Turnout log10ELECTORATE, msize(medsmall) msymbol(circle) mfcolor(gs12%40) mlcolor(gs4) mlwidth(vthin) jitter(1) xtitle("log{sub:10} Electorate Size", size(vsmall)) ytitle("Turnout (% reg. voters)", size(vsmall)) title("Panel B", size(small) position(11) margin(vsmall)) aspect(1) xscale(r(5.5 8.5)) xlabel(5.5(.5)8.5, format(%4.1f)) graphregion(margin(small)) plotregion(margin(small))
*save graph as png
graph export "scatter_transform_2.png", as(png) name("Graph") replace
// Panel C: Turnout/1-Turnout by log10(electorate size)
* Rescaling by turnout odds ratio is not an improvement
* The transformation "squashes" turnout and interpretation is opaque
twoway(scatter ORturnout log10ELECTORATE, msize(medsmall) msymbol(circle) mfcolor(gs12%40) mlcolor(gs4) mlwidth(vthin) jitter(1) xtitle("log{sub:10} Electorate Size", size(vsmall)) ytitle("Turnout (Odds Ratio)", size(vsmall)) title("Panel C", size(small) position(11) margin(vsmall)) aspect(1) xscale(r(5.5 8.5)) xlabel(5.5(.5)8.5, format(%4.1f)) yscale(r(0 40)) ylabel(0(5)40, format(%4.1f)) graphregion(margin(small)) plotregion(margin(small)))
*save graph as png
graph export "scatter_transform_3.png", as(png) name("Graph") replace
//Panel D: Adding Secondary Axes
* This took a LOT of work; it's not intuitive & this code is well worth saving
* See https://www.stata-journal.com/sjpdf.html?articlenum=gr0032
* Must be run as a chunk!
foreach n of numlist 27 50 82 95 99 {
local labely `labely' `= round(logit(`n'/100))' "`n'"
}
foreach m of numlist 1 10 100 {
local labelx `labelx' `= round(log10(`m'*1000000))' "`m'"
}
scatter logitTO log10ELECTORATE, ylabel(`labely') ytitle("Turnout (% reg. voters)", size(vsmall)) msymbol(circle) mfcolor(none) mlcolor(none)||scatter logitTO log10ELECTORATE, msize(medsmall) msymbol(circle) mfcolor(gs12%40) mlcolor(gs4) mlwidth(vthin) jitter(1) yaxis(2) ylabel(-1 0 1.5 3 4.5, axis(2)) ytitle("log{sub:10}{sup:Turnout}/{sub:1-Turnout}", size(vsmall) axis(2)) xaxis(1 2) xtitle("log{sub:10} Electorate Size", size(vsmall) axis(1)) xscale(r(5.5 8.5) axis(1)) xlabel(5.5 (.5) 8.5, format(%4.1f)) xlabel(`labelx', format(%4.2f) axis(2)) xtitle("Electorate Size (millions)", axis(2) size(vsmall)) legend(off) title("Panel D", size(small) position(11) margin(vsmall)) aspect(1) graphregion(margin(small)) plotregion(margin(small))
*save graph as png
graph export "scatter_transform_4.png", as(png) name("Graph") replace
//A panel of Graphs:
graph combine "scatter_transform_1.gph" ///
"scatter_transform_2.gph" ///
"scatter_transform_3.gph" ///
"scatter_transform_4.gph", ///
cols(2) colfirst scale(.8) ///
title("Turnout and Electorate Size", size(medsmall)) ///
note("Source: Kostelka & Blais (2021)" "Legislative elections in countries with 20+ elections in dataset", size(vsmall))
*save graph as png
graph export "table_transformed_scatters.png", as(png) name("Graph") replace
//Final Product
*This is the final product with nice labels and markers
*This combines the best aspects of Panel B and the secondary x-axis of Panel D
*It must be run as a chunk!
foreach m of numlist 1 10 100 {
local labelx `labelx' `= round(log10(`m'*1000000))' "`m'"
}
scatter Turnout log10ELECTORATE, ytitle("Turnout (% reg. voters)", size(small)) msymbol(circle) mfcolor(none) mlcolor(none) xaxis(1 2) xtitle("log{sub:10} Electorate Size", size(small) axis(1)) xscale(r(5.5 8.5) axis(1)) xlabel(5.5 (.5) 8.5, format(%4.1f)) xlabel(`labelx', format(%4.2f) axis(2)) xtitle("Electorate Size (millions)", axis(2) size(vsmall))||scatter Turnout log10ELECTORATE, msize(medsmall) msymbol(circle) mfcolor(gs12%40) mlcolor(gs4) mlwidth(vthin) jitter(1) legend(off) title("Turnout and Electorate Size", size(medsmall)) note("Source: Kostelka & Blais (2021)" "Legislative elections in countries with 20+ elections in dataset", size(vsmall)) aspect(1)
*save graph
graph export "scatter_transform_5.png", as(png) name("Graph") replace
Implementing Lowess in Software
This code chunk shows how to employ Stata’s lpoly
command to estimate loess curves.
// lpoly: degree = degree of polynominal; bwidth = bandwidth (denominated in N observations) in window.
*Ex. 1: The degree of the polynomial matters less than you think
twoway (scatter Turnout Majority if country == 5, sort msize(medsmall) msymbol(circle) mfcolor(gs8%50) mlcolor(black) mlwidth(vthin) msize(medium)) (lpoly Turnout Majority if country == 5, degree(1) bwidth(2) lpattern(solid) lcolor(reddish)) (lpoly Turnout Majority if country == 5, degree(3) bwidth(2) lpattern(longdash) lcolor(black)), xmtick(##10, labsize(small)) yscale(r(75 100)) ylabel(75(5)100) xtitle("Majority %") ytitle("Turnout %") title("Turnout at Austrian legislative elections, 1950-2021") legend(on order(1 "Turnout (%)" 2 "Degree = 1" 3 "Degree = 3") span) note("Data source: Kostelka & Blais (2021)" "Bandwidth = 2", size(vsmall)) aspect(1)
*Ex. 2: Holding degree at 1, what is effect of bandwidth?
twoway (scatter Turnout Majority if country == 5, sort msize(medsmall) msymbol(circle) mfcolor(gs8%50) mlcolor(black) mlwidth(vthin) msize(medium)) (lpoly Turnout Majority if country == 5, degree(1) bwidth(1.25) lpattern(solid) lcolor(reddish)) (lpoly Turnout Majority if country == 5, degree(1) bwidth(2.50) lpattern(longdash) lcolor(black)), xmtick(##10, labsize(small)) yscale(r(75 100)) ylabel(75(5)100) xtitle("Majority %") ytitle("Turnout %") title("Turnout at Austrian legislative elections, 1950-2021") legend(on order(1 "Turnout (%)" 2 "Bandwidth = 1.25" 3 "Bandwidth = 2.50") span) note("Data source: Kostelka & Blais (2021)" "Degree = 1", size(vsmall)) aspect(1)
*Ex. 3: Is a polynomial of degree 1, bandwidth 2.5 defensible?
*estimate the lowess curve and save the predicted values
quietly: lpoly Turnout Majority if country == 5, degree(1) bwidth(2.50) gen(Predicted_Values)at(Majority)
*If y = f(x) + e, then e = y - f(x). Hence, we get residuals by subtracting predicted from observed turnout.
quietly: gen Residuals = Turnout - Predicted_Values
*This interim plot shows how to visualize the residuals
twoway (lpoly Turnout Majority if country == 5, degree(1) bwidth(2.50) lpattern(longdash) lcolor(sky) lwidth(thin)) (scatter Turnout Majority if country == 5, sort msize(medsmall) msymbol(circle) mfcolor(gs8%50) mlcolor(black) mlwidth(vthin) msize(medium)) (scatter Predicted_Values Majority if country == 5, sort msize(medsmall) msymbol(plus) mfcolor(black) mlcolor(black) mlwidth(vthin) msize(medlarge)) , xmtick(##10, labsize(small)) yscale(r(75 100)) ylabel(75(5)100) xtitle("Majority %") ytitle("Turnout %") title("Turnout at Austrian legislative elections, 1950-2021") legend(on order(1 "Lowess (deg: 1; b-width: 2.5)" 2 "Observed Turnout (%)" 3 "Predicted Turnout (%)") span) note("Data source: Kostelka & Blais (2021)", size(vsmall)) aspect(1)
*Ex 4: Residuals vs Majority plot + lowess
*The lowess curve is pretty flat -- so that is OK
*But we have a clump of large residuals -- our model is missing something in the data!?
twoway (lpoly Residuals Majority if country == 5, degree(1) bwidth(2.5) lpattern(solid) lcolor(reddish) lwidth(medthin)) (scatter Residuals Majority if country == 5, mstyle(circle) msize(medsmall)), yscale(r(-6 6)) ylabel(-6(2)6) yline(0, lcolor(gs8)) legend(off) xtitle("Majority (%)") ytitle("Lowess Residuals") title("Lowess residuals graphed" "against winning party's majority" "at Austrian legislative elections.") aspect(1)
*Check the Austria data! Nonparametric approaches do not make misspecification error disappear.
twoway (scatter Turnout Majority if country == 5 & CV >0, msymbol(circle) msize(medium) mcolor(black)) (scatter Turnout Majority if country == 5 & CV ==0, msymbol(circle) msize(medium) mcolor(gs10)), legend(on order(1 "Compulsory" 2 "Voluntary")) note("Data source: Kostelka & Blais (2021)" "Note: Majority status is absolute value of winning " "party's vote share - 50 percent", size(vsmall)) aspect(1) ylabel(75(5)100) title("Turnout and majority status at" "Austrian elections, 1950-2021:") subtitle("Compulsory and voluntary voting.", size(small))
Q-Q Plots in Stata
This code reproduces the Q-Q plot in the text above.
//Load V-Dem data
use "V-Dem-CY-Full+Others-v14.dta", clear
//rescale V-Dem electoral democracy (polyarchy) index
gen PolyX10 = gen PolyX10 = v2x_polyarchy * 10
//generate Q-Q plot
qqplot e_democ PolyX10, mfcolor(none) mlcolor(gs9) mlwidth(vthin) rlopts(lcolor(black) lwidth(vthin) lpattern(dash)) ytitle("Polity - Democracy Score" , size(small)) ylabel(0(2)10, labsize(vsmall)) ymtick(##2) xtitle("V-Dem - Electoral Democracy Index", size(small)) xlabel(0(2)10, labsize(vsmall)) xmtick(##2) title("Benchmarking Polity's Democracy Dimension" "vs V-Dem's Electoral Democracy Index", size(small)) aspect(1)
Log transformations and scatterplots in R
This code produces analogs to the Stata scatterplots of turnout and
logged electorate size.
# Load necessary libraries
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(haven)
# Load data
Kostelka_Blais_Base <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kostelka & Blais - Turnout/Kostelka & Blais Base.dta")
elections_data<-Kostelka_Blais_Base
head(elections_data)
# Process data
# Filter out presidential elections (EL_TYPE ==1)
elections_data <- filter(elections_data, EL_TYPE==0)
# Compute N records per country, filter out those with <20
elections_data <- elections_data %>%
group_by(country = as.character(country)) %>%
mutate(NumRecords = n()) %>%
ungroup() %>%
filter(NumRecords >= 20)
# A basic scatterplot
ggplot(data = elections_data, aes(x=electorate_size, y=Turnout)) +
geom_point()
# Transforming a variable using mutate()
elections_data <- elections_data %>%
mutate(log10_electorate_size = log10(electorate_size*1000000))
# Repeat scatterplot with log10 electorate size
ggplot(data = elections_data, aes(x=log10_electorate_size, y=Turnout)) +
geom_point()
# This computes log2 electorate size
elections_data <- elections_data %>%
mutate(log2_electorate_size = log2(electorate_size*1000000))
# Repeat scatterplot with log10 electorate size
ggplot(data = elections_data, aes(x=log2_electorate_size, y=Turnout)) +
geom_point()
## Making it pretty: Let's improve the scatterplot's aesthetics by:
# - removing ink from the markers to reduce occlusion
# - converting to white background to improve contrast
# - labels to inform readers about what's plotted
## An improved scatterplot
ggplot(data = elections_data, aes(x=log2_electorate_size, y=Turnout)) +
geom_point(shape=1, color="gray60", size=2) +
labs(x=bquote("log"[2]*" Electorate Size"), y="Turnout (%)", title="Turnout & Electorate Size", subtitle="Legislative elections in 16 countries, 1946-2016", caption="Data Source: Kostelka & Blais (2021)") +
coord_fixed(ratio=.1) +
theme_bw()
## Adding a 2nd x-axis
plot2axes <- ggplot(data = elections_data, aes(x=log10_electorate_size, y=Turnout)) +
geom_point(shape=1, color="gray60", size=2) +
labs(x=bquote("log"[10]*" Electorate Size"), y="Turnout (%)", title="Turnout & Electorate Size", subtitle="Legislative elections in 16 countries, 1946-2016", caption="Data Source: Kostelka & Blais (2021)") +
coord_fixed(ratio=.025) +
theme_bw()
plot2axes + scale_x_continuous(
name = bquote("log"[10]*" Electorate Size"),
sec.axis = sec_axis(~10^.,
name = "Electorate size (in millions)",
breaks = c(1e6, 10e6, 100e6), # Set breaks at 1, 10, and 100 million
labels = c("1", "10", "100"), # Set labels as 1, 10, and 100 million
)
)
Loess in R
This code replicates the loess plots above.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(haven)
#STAGE 1: LOAD & PREPROCESS DATA
Kostelka_Blais_Base <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kostelka & Blais - Turnout/Kostelka & Blais Base.dta")
elections_data<-Kostelka_Blais_Base
# head(elections_data)
#Filter out presidential elections (EL_TYPE ==1)
elections_data <- filter(elections_data, EL_TYPE==0)
#Compute N records per country, filter out those with <20
elections_data <- elections_data %>%
group_by(country = as.character(country)) %>%
mutate(NumRecords = n()) %>%
ungroup() %>%
filter(NumRecords >= 20)
#STAGE 2: FIT LOESS CURVE TO DATA
#Lowess regression - linear regression, bandwidth .4
elections_data <- elections_data %>%
mutate(log10_electorate_size = log10(electorate_size*1000000))
ggplot(data = elections_data, aes(x=log10_electorate_size, y=Turnout)) +
geom_point(shape=1, color="gray60", size=2) +
geom_smooth(formula = y ~ x, method = "loess", span = .4, se=TRUE, color="gray20") + # loess curve with bandwidth .4
labs(x=bquote("log"[10]*" Electorate Size"), y="Turnout (%)", title="Turnout & Electorate Size", subtitle="Legislative elections in 16 countries, 1946-2016", caption="Data Source: Kostelka & Blais (2021)") +
coord_fixed(ratio=.025) +
theme_bw()
#STAGE 3: DIAGNOSE LOESS BANDWIDTH
# Fit loess curve
fit <- loess(Turnout ~ log10_electorate_size, data = elections_data)
# Compute predicted values
predicted <- predict(fit)
# Compute residuals
residuals <- elections_data$Turnout - predicted
#merge above to elections_data
elections_data <- data.frame(elections_data, Predicted_TO = predicted, Residuals_TO = residuals)
#plot Predicted_TO vs log10_electorate_size
elections_data <- elections_data %>%
mutate(log10_electorate_size = log10(electorate_size*1000000))
ggplot(data = elections_data, aes(x=log10_electorate_size, y=Residuals_TO)) +
geom_point(shape=1, color="gray60", size=2) +
geom_smooth(formula = y ~ x,method = "loess", span = .4, se=FALSE) + # loess curve with bandwidth .4, se=TRUE returns CIs
labs(x=bquote("log"[2]*" Electorate Size"), y="Loess Residuals", title="Assessing propriety of bandwidth = .4", caption="Data Source: Kostelka & Blais (2021)") +
coord_fixed(ratio=.025) +
theme_bw()
Quantile-Quantile Plots in R
ggplot2 does not readily produce Q-Q plots, but a base R
function qqplot does
#load data
library(haven)
V_Dem_CY_Full_Others_v14 <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/V-DEM/V-Dem-CY-FullOthers-v14_dta/V-Dem-CY-Full+Others-v14.dta")
#adapt range of polyarchy variable
library(dplyr)
V_Dem <- V_Dem_CY_Full_Others_v14 %>%
mutate(PolyX10 = v2x_polyarchy *10)
#replace missing value codes on Polity democracy variable
V_Dem <- V_Dem %>%
mutate(e_democ = ifelse(e_democ < 0, NA, e_democ))
#base R graphics work differently the ggplot:
#1. set a square graph region:
par(pty = "s")
#2. use base R's qqplot()
qqplot(
V_Dem$PolyX10, V_Dem$e_democ,
xlab = "V-Dem Electoral Democracy Index", # X-axis label
ylab = "Polity Democracy Score", # Y-axis label
pch = 21, # Circle with fill
col = "black", # Border (outline) color
bg = "gray90", # Fill color (light gray)
lwd = 1, # Thicker outline
main = "Benchmarking Polity's Democracy Scale
\nvs. V-Dem's Electoral Democracy Index", # Optional main title
font.main = 1, # Use regular font on title
cex.main = 0.8, # Smaller title font too
asp = 1 # Aspect ratio of 1 works to place emphasis on 45-degree line
)
#3. add the diagonal reference line
abline(0, 1, col = "red", lty = 2, lwd = 1.5)
It was much easier to re-order the facets of the “better” trend-line
plot in ggplot because of the facet_order
function. Getting the minimal theme to look the way I wanted it took
some extra code–but it was easy to add because of ggplot’s
modularity.
# Load required libraries
library(ggplot2)
library(dplyr)
library(forcats)
library(haven) # For importing Stata files
# Load the Stata data set
data <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Indian States Calamity Data/Calamity.dta")
# Step 1: Replace "&" with "and" in state names
data$statenm <- gsub("&", "and", data$statenm)
# Step 2: Calculate mean flood-affected area for each state
state_means <- data %>%
group_by(statenm) %>%
summarize(mean_fafarea = mean(fafarea, na.rm = TRUE)) %>%
ungroup()
# Step 3: Merge mean flood area data back into the main data set
data <- data %>%
left_join(state_means, by = "statenm") %>%
mutate(state_order = fct_reorder(statenm, mean_fafarea))
# Step 4: Create faceted time series plot with fixed y-axis across all facets
ggplot(data, aes(x = year, y = fafarea)) +
geom_line(linewidth=.7) +
facet_wrap(~state_order, scales = "fixed", ncol = 4) + # **Ensures uniform y-axes**
labs(
title = "Flooding in Indian States, 1950-1996",
subtitle = "Data Source: Besley & Burgess (2000)",
x = "",
y = "Flood Affected Area (millions hectares)"
) +
theme_minimal(base_size = 14) + # Clean theme
theme(
strip.text = element_text(size = 8), # Facet labels more readable
axis.text.x = element_text(size = 9, angle = 45),
axis.text.y = element_text(size = 9),
# **Restore Tick Marks**
axis.ticks = element_line(color = "black", linewidth = 0.5),
# **Major grid lines only**
panel.grid.major = element_line(color = "gray70", linewidth = 0.5),
panel.grid.minor = element_blank(), # Remove minor grid lines
# **Simulate panel border**
panel.background = element_rect(color = "black", fill = NA, linewidth = 0.5),
# **Adjust aspect ratio for better horizontal comparisons**
aspect.ratio = 1/3
) +
# Ensure x-axis spans from 1950 to 1995 with major ticks at 10-year intervals
scale_x_continuous(
limits = c(1950, 1995),
breaks = seq(1950, 1995, by = 10) # Major ticks at 10-year intervals
) +
# Ensure y-axis has major ticks at 3-unit intervals (no minor ticks)
scale_y_continuous(
breaks = seq(floor(min(data$fafarea, na.rm = TRUE)), ceiling(max(data$fafarea, na.rm = TRUE)), by = 3)
)
Social Science Data Archives
Major social science data archives offer a wealth of survey data, opinion polls, and other social indicators. Some of the main social science data archives are:
These archives often include detailed metadata, making them highly useful for structured research projects.